Solving Cyclic Longest Common Subsequence in Quadratic Time 

Andy Nguyen* 
August 17, 2012 



Abstract 

We present a practical algorithm for the cyclic longest common subsequence (CLCS) problem that runs in 0(mn) 
time, where m and n are the lengths of the two input strings. While this is not necessarily an asymptotic improvement 
over the existing record, it is far simpler to understand and to implement. 

1 Introduction 

The longest common subsequence (LCS) problem is a classic problem whose myriad extensions are applicable to 
many fields ranging from string matching for DNA sequence alignment to dynamic time warping for shape distance. 
While the original problem considers strings with a beginning and an end as input, a natural extension is to consider 
cyclic strings, where the starting point is not fixed on either string. This extension is useful for matching such objects 
as DNA plasmids and closed curves. In this paper we present a practical algorithm for the cyclic longest common 
subsequence {CLCS) problem that runs in 0(mn) time, where m and n are the lengths of the two input strings. 

2 Related Work 

The state-of-the-art algorithm for regular LCS comes from [4], who apply the standard dynamic programming algo- 
rithm with the "Four Russians" speedup [lj to achieve a running time of 0(n 2 /\gn). There are also algorithms (cite) 
that achieve better runtimes depending on the input, but all of these algorithms are asymptotically slower in the worst 
case. 

The cyclic version of LCS has seen more recent improvements. The first major improvement comes from ||3l , 
which uses a divide-and-conquer strategy to solve the problem in 0(mnlgm) runtime. Since then, this approach 
has been applied to generalizations to the problem, and practical optimizations have been discovered; see for 
example. We describe the setup used in these papers in our preliminaries, though we build a different approach on 
top of this setup to arrive at our new algorithm. f2|j provides a solution that runs in 0(n 2 ) time on arbitrary inputs, 
with asymptotically better performance on similar inputs; however, this solution requires considerable amounts of 
machinery both to understand and to implement. 

3 Preliminaries 

Given a string A of length n and an integer k, < k < n, we can define cut (A, k) = substring(A,k,n) + substring(A,Q,k). 
For example, CM/("abcd",2) = "cd" + "ab" = "cdab", while cM/("abcd",0) = "abcd" + "" = "abed". We can 
extend this definition to any integer k by defining cut(A,k) — cut(A,k mod n) when k < or k > n. 
Let us define the following: 

Definition 1. Let A and B be strings of length m and n respectively, where m <n. A cyclic longest common subse- 
quence of A and B CLCS(A 1 B) is a string that satisfies the following properties: 
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1. CLCS(A,B) = LCS(cut(A,i) 1 cut(B 1 j)) for some integers i and j. 

2. \CLCS(A.B)\ > \LCS(cut(A,i) 1 cut(B 1 j))\ for all integers i and j. 

We can first observe that CLCS is no easier than LCS. To see this, we observe that we can solve LCS(A,B) using 
CLCS by prepending A and B with m x's, where x is a symbol not found in either string. This forces CLCS to keep A 
and B aligned with each other to match all the x's, allowing us to extract LCS(A,B) from the result. 

On the other hand, we can also show that CLCS is not much harder than LCS. We could solve CLCS by solving 
mn instances of LCS, corresponding to each possible pair of cuts of the strings A and B. This implies that we have a 
solution to CLCS that runs in 0(m 2 n 2 ) time. In fact, we can do even better: 

Lemma 1. There is some cyclic longest common subsequence of A and B CLCS(A,B) that satisfies the following 
properties: 

1. CLCS(A,B) = LCS(cut(A,k),B) for some integer k. 

2. \CLCS(A,B)\ > \LCS(cut(A,i),cut(B 1 j))\ for all integers i and j. 

Proof Given in 0. □ 

This lemma allows us to run only m instances of LCS, giving us a runtime of 0(m 2 n). Now let's see if we can do 
better. To do so, let's take a short detour to look at regular LCS. Recall that we can solve LCS(A,B) in 0(mn) time 
with the use of a two-dimensional dynamic programming (DP) table that stores lengths len(i,j) and parent pointers 
parent(i,j). Normally we look at this table as a two-dimensional array; however, we can also look at this table as a 
directed grid graph Ga,b as follows (Figure[Tji: 
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Figure 1 : Visualizing the DP table as a grid graph. 



• For each entry <?,, in the table we have a single node v. 

• For each adjacent pair of entries e u and e v in the table where e„ is considered when computing e v , we have a 
directed edge from u to v. Note that all edges point down and/or to the right, so the graph is acyclic. We see 
that all of the horizontal and vertical edges are present in the graph, and in addition we have diagonal edges 
wherever the corresponding characters in the two strings match. 

For our convenience, we will fix the orientation of the graph as shown in Figure [T| namely, that (0, 0) is the top 
left and (m,n) is the bottom right. In this orientation, the first coordinate denotes the row, and the second coordinate 
denotes the column. 

Lemma 2. Finding an LCS(A,B) is equivalent to finding a shortest path from (0,0) to (m,n) in Ga,b- 

Proof Given in 0. □ 
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Corollary 1. The set of parent pointers computed by any DP solution to LCS(A,B) (when treated as undirected edges) 
forms a shortest path tree of Ga,b rooted at (0, 0). 

Note that this holds true regardless of the choice of parent pointer in case of ties. 

This means that we can solve DCS by solving a shortest path problem on a directed acyclic graph with unit edge 
lengths. 

4 Algorithm 

According to Lemma 2, we can solve CDCS by solving m shortest-path problems, one for each cut of A. These shortest- 
path problems are closely related, as we see when we construct the graph Gaa.b (FigureEJ. Then the shortest paths we 
want to find are the ones from (0,0) to (m,n), from (1,0) to (m + l,n), . . ., and from (m — 1,0) to (2m — l,n). Now, 
we haven't done anything that will save us computation time; it will still cost us 0(m 2 n) time to find these m shortest 
paths. However, the hope is that since these paths all live on the same graph, we can reuse some work between shortest 
path computations. 
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Figure 2: Constructing the graph Gaa,b- 

As we have fixed the orientation of the graph, we can call one edge lower than another with respect to this 
orientation, and we can do so similarly for paths. We then define the following: 

Definition 2. A lowest shortest path tree of Ga.b is a shortest path tree where for every node v, the path from the root 
to v contained in the shortest path tree is lower than any other shortest path from the root to the node. 

First, we show how to compute a lowest shortest path tree efficiently. 

Lemma 3. The set of parent pointers computed by the DP solution that favors 4— first, then \, and finally f in case 
of ties forms a lowest shortest path tree of Ga.b rooted at (0, 0). 
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Proof. Suppose for contradiction that there is some node v where there is a lower shortest path p' from (0,0) to v than 
the one p contained in the shortest path tree. Beginning at v and tracing back towards (0,0), p' and p must eventually 
diverge at some node w. This divergence means that p' and p follow two different valid parent pointers. But since p 
follows the parent pointers that favor 4— first, then \, and finally f, it follows that p' cannot be lower than p. □ 

This lemma implies that maintaining the DP tiebreaking policy locally will yield a lowest shortest path tree glob- 
ally. We'll use this to our advantage to bound the work we need to do to extract all m shortest paths. 

If we compute the lowest shortest path tree of Gaa,b, we can immediately read off the shortest path from (0,0) to 
(m,n). Before we can read the shortest path from (1,0) to (m + l,n), we need to re-root the lowest shortest path tree 
at (1,0). We can see that this is equivalent to removing the top row entirely, and then fixing the remainder of the tree 
to restore its properties. 

Lemma 4. If we remove the top row of nodes from the lowest shortest path tree rooted at (0,0), we are left with at 
most two subtrees L and (possibly) R. 

Proof. We show this by showing there are at most two edges between the top row and the rest of the tree. The vertical 
edge on the far left, from (0,0) to (1,0), must clearly be in the tree, and we denote the subtree rooted at (1,0) to be 
L. There cannot be any other vertical edges, since the vertical edge on the far left provides us with a lower path to any 
node in the second row. Now consider the first diagonal edge, scanning from left to right, from (0,k) to (l,k+ 1). We 
denote the subtree rooted at (l,k+ 1) to be R. By similar reasoning, there cannot be any other diagonal edges to the 
right of this one. □ 

Note that because of the tiebreaking rule we use for the lowest shortest path tree, any node in R that is adjacent to 
a node in L (to the left or diagonally up and left; we call these nodes in R boundary nodes) chose its parent to be in R 
because that choice was strictly better than the parent option in L. 

We next note the following about the length entries in the DP table corresponding to LCS(AA,B) (recall that these 
are lengths of the common subsequence, not lengths of the paths in the graph): 

Lemma 5. For any entry (i, j) in the table, < len(i, j) — len(i, j — 1 ) < 1. 

Proof. Clearly len(ij) > len(ij — 1). Now suppose for contradiction that len(ij) — len(i,j — 1) > 1. We trace back 
along the parent pointers starting from (i,j) until we enter column j —I. The moment we do so, we find an entry 
where the length is either len(i,j) if we followed a <— or len(i,j) — 1 if we followed a \. Either way, this entry is 
above — 1) in the same column, yet strictly larger, so we could have filled this entry straight down to — 1), 
which violates the optimality of len(ij — 1). □ 

This lemma combined with our earlier observation about L and R leads us to the following key fact: 

Corollary 2. Let (i,j) be such that (i,j) € R but (i,j — 1) S L. Then len(i,j) = len(i,j — 1) + 1. 

Now, if we were to reconnect the two subtrees by setting parent (1, k+ 1) =4—, we would lower the length values 
of all nodes in R by exactly 1 (since we lose the diagonal of parent (l,k+ 1)). But since every boundary node's old 
parent choice was strictly better than the option in L, lowering the length values of all nodes in R does not result in 
any inconsistencies in the length table, which means we have a shortest path tree. This is not necessarily a lowest 
shortest path tree; however, Corollary 2 tells us that after this reconnection, every boundary node (i,j) G R satisfies 
len(i,j) — len(i,j — 1), which means in a lowest shortest path tree, parent(i,j) =■<— . Furthermore, every node in L 
has its length unchanged, and none of its potential parents gained length, so its parent pointer is unchanged. Similarly, 
every non-boundary node in R loses exactly one unit of length, as do all of its potential parents, so its parent pointer is 
also unchanged. 

This means if we set the parent of every boundary node to be <— , we now have a lowest shortest path tree, rooted 
at (1,0). Note that we can walk along the graph to find all boundary nodes by moving down or right (or both), which 
means we can perform this rerooting process in 0(m + n) = 0(n) time. The pseudocode is given below, where we 
assume that the input tree is rooted at (root — 1,0). 



4 



RE-RoOT(root,m,n, parent) 
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With this process in place, we can now extract the m shortest paths from Gaa.b by computing the lowest shortest 
path tree of Gaa.b, and then alternating between reading the shortest path from the root to the appropriate end node 
and re-rooting the lowest shortest path tree one node lower. We only need to do so m times, which means the entire 
process runs in 0{mn) time. The pseudocode is given below. 

CLCS(A,B) 



1 m=A.length 

2 n=B. length 

3 let len [0 ... 2m, ... n] and parent [0 ... 2m, ... n] be new tables 

4 LCS(A4,B, len, parent) 

5 S = TRACE-LCS(m, n, A, parent) 

6 for i = 1 to m — 1 

7 RE-ROOT(i, m,n, parent) 

8 S = TRACE-LCS(m + i,n,A, parent) 

9 if S. length < S. length 

10 s = s 

1 1 return S 



5 Implementation 

If we need to return the subsequence string, we must follow the pseudocode above. If we only need to return the string's 
length, we can skip the traceback subroutine and instead simply read off the entry in the length table corresponding to 
the start of the traceback. In this case, we need to update the far edge of the length table after each reroot, decrementing 
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the lengths by 1. In addition to simplifying the code, this change reduces the cache inefficiency of the algorithm, 
resulting in faster runtimes in practice. 

It is worth noting that implementing this algorithm requires far less effort than either the 0(mnlgm) algorithm 
given in Q or the 0(n 2 ) algorithm given in 

6 Conclusions 

We have presented a practical algorithm that solves CLCS in 0(mn) time. Note that the proof of correctness relies 
heavily on the fact that all edges in the graph have weight 1 . This means that this algorithm does not naturally apply 
to the usual extensions of the problem, such as edit distance and dynamic time warping. It may be worth investigating 
whether the algorithm can be adapted to certain classes of edge weights. 
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